Goto

Collaborating Authors

 robust speech recognition


A Cocktail-Party Benchmark: Multi-Modal dataset and Comparative Evaluation Results

arXiv.org Artificial Intelligence

We introduce the task of Multi-Modal Context-Aware Recognition (MCoRec) in the ninth CHiME Challenge, which addresses the cocktail-party problem of overlapping conversations in a single-room setting using audio, visual, and contextual cues. MCoRec captures natural multi-party conversations where the recordings focus on unscripted, casual group chats, leading to extreme speech overlap of up to 100% and highly fragmented conversational turns. The task requires systems to answer the question "Who speaks when, what, and with whom?" by jointly transcribing each speaker's speech and clustering them into their respective conversations from audio-visual recordings. Audio-only baselines exceed 100% word error rate, whereas incorporating visual cues yields substantial 50% improvements, highlighting the importance of multi-modality. In this manuscript, we present the motivation behind the task, outline the data collection process, and report the baseline systems developed for the MCoRec.


Towards Robust Speech Recognition for Jamaican Patois Music Transcription

arXiv.org Artificial Intelligence

Although Jamaican Patois is a widely spoken language, current speech recognition systems perform poorly on Patois music, producing inaccurate captions that limit accessibility and hinder downstream applications. In this work, we take a data-centric approach to this problem by curating more than 40 hours of manually transcribed Patois music. We use this dataset to fine-tune state-of-the-art automatic speech recognition (ASR) models, and use the results to develop scaling laws for the performance of Whisper models on Jamaican Patois audio. We hope that this work will have a positive impact on the accessibility of Jamaican Patois music and the future of Jamaican Patois language modeling.


MoHAVE: Mixture of Hierarchical Audio-Visual Experts for Robust Speech Recognition

arXiv.org Artificial Intelligence

Audio-visual speech recognition (AVSR) has become critical for enhancing speech recognition in noisy environments by integrating both auditory and visual modalities. However, existing AVSR systems struggle to scale up without compromising computational efficiency. In this study, we introduce MoHAVE (Mixture of Hierarchical Audio-Visual Experts), a novel robust AVSR framework designed to address these scalability constraints. By leveraging a Mixture-of-Experts (MoE) architecture, MoHAVE activates modality-specific expert groups, ensuring dynamic adaptation to various audio-visual inputs with minimal computational overhead. Key contributions of MoHAVE include: (1) a sparse MoE framework that efficiently scales AVSR model capacity, (2) a hierarchical gating mechanism that dynamically utilizes the expert groups based on input context, enhancing adaptability and robustness, and (3) remarkable performance across robust AVSR benchmarks, including LRS3 and MuAViC transcription and translation tasks, setting a new standard for scalable speech recognition systems.


Noise Suppression Based on Neurophysiologically-motivated SNR Estimation for Robust Speech Recognition

Neural Information Processing Systems

For SNR-estimation, the input signal is transformed into so-called Amplitude Modulation Spectrograms (AMS), which rep(cid:173) resent both spectral and temporal characteristics of the respective analysis frame, and which imitate the representation of modula(cid:173) tion frequencies in higher stages of the mammalian auditory sys(cid:173) tem. A neural network is used to analyse AMS patterns generated from noisy speech and estimates the local SNR. Noise suppres(cid:173) sion is achieved by attenuating frequency channels according to their SNR. The noise suppression algorithm is evaluated in speaker(cid:173) independent digit recognition experiments and compared to noise suppression by Spectral Subtraction.


Combining ICA and Top-Down Attention for Robust Speech Recognition

Neural Information Processing Systems

We present an algorithm which compensates for the mismatches between characteristics of real-world problems and assumptions of independent component analysis algorithm. To provide additional information to the ICA network, we incorporate top-down selec(cid:173) tive attention. An MLP classifier is added to the separated signal channel and the error of the classifier is backpropagated to the ICA network. This backpropagation process results in estimation of expected ICA output signal for the top-down attention. Then, the unmixing matrix is retrained according to a new cost function representing the backpropagated error as well as independence.


ALGONQUIN - Learning Dynamic Noise Models From Noisy Speech for Robust Speech Recognition

Neural Information Processing Systems

A challenging, unsolved problem in the speech recognition com(cid:173) munity is recognizing speech signals that are corrupted by loud, highly nonstationary noise. One approach to noisy speech recog(cid:173) nition is to automatically remove the noise from the cepstrum se(cid:173) quence before feeding it in to a clean speech recognizer. In previous work published in Eurospeech, we showed how a probability model trained on clean speech and a separate probability model trained on noise could be combined for the purpose of estimating the noise(cid:173) free speech from the noisy speech. We showed how an iterative 2nd order vector Taylor series approximation could be used for prob(cid:173) abilistic inference in this model. In many circumstances, it is not possible to obtain examples of noise without speech.


Audio-Visual Efficient Conformer for Robust Speech Recognition

arXiv.org Artificial Intelligence

End-to-end Automatic Speech Recognition (ASR) systems based on neural networks have seen large improvements in recent years. The availability of large scale hand-labeled datasets and sufficient computing resources made it possible to train powerful deep neural networks, reaching very low Word Error Rate (WER) on academic benchmarks. However, despite impressive performance on clean audio samples, a drop of performance is often observed on noisy speech. In this work, we propose to improve the noise robustness of the recently proposed Efficient Conformer Connectionist Temporal Classification (CTC)-based architecture by processing both audio and visual modalities. We improve previous lip reading methods using an Efficient Conformer back-end on top of a ResNet-18 visual front-end and by adding intermediate CTC losses between blocks. We condition intermediate block features on early predictions using Inter CTC residual modules to relax the conditional independence assumption of CTC-based models. We also replace the Efficient Conformer grouped attention by a more efficient and simpler attention mechanism that we call patch attention. We experiment with publicly available Lip Reading Sentences 2 (LRS2) and Lip Reading Sentences 3 (LRS3) datasets. Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps. Our Audio-Visual Efficient Conformer (AVEC) model achieves state-of-the-art performance, reaching WER of 2.3% and 1.8% on LRS2 and LRS3 test sets. Code and pretrained models are available at https://github.com/burchim/AVEC.


Robust Speech Recognition via Large-Scale Weak Supervision

arXiv.org Artificial Intelligence

We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zero-shot transfer setting without the need for any fine-tuning. When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing.


Speech-enhanced and Noise-aware Networks for Robust Speech Recognition

arXiv.org Artificial Intelligence

Compensation for channel mismatch and noise interference is essential for robust automatic speech recognition. Enhanced speech has been introduced into the multi-condition training of acoustic models to improve their generalization ability. In this paper, a noise-aware training framework based on two cascaded neural structures is proposed to jointly optimize speech enhancement and speech recognition. The feature enhancement module is composed of a multi-task autoencoder, where noisy speech is decomposed into clean speech and noise. By concatenating its enhanced, noise-aware, and noisy features for each frame, the acoustic-modeling module maps each feature-augmented frame into a triphone state by optimizing the lattice-free maximum mutual information and cross entropy between the predicted and actual state sequences. On top of the factorized time delay neural network (TDNN-F) and its convolutional variant (CNN-TDNNF), both with SpecAug, the two proposed systems achieve word error rate (WER) of 3.90% and 3.55%, respectively, on the Aurora-4 task. Compared with the best existing systems that use bigram and trigram language models for decoding, the proposed CNN-TDNNF-based system achieves a relative WER reduction of 15.20% and 33.53%, respectively. In addition, the proposed CNN-TDNNF-based system also outperforms the baseline CNN-TDNNF system on the AMI task.


Optimized Power Normalized Cepstral Coefficients towards Robust Deep Speaker Verification

arXiv.org Artificial Intelligence

After their introduction to robust speech recognition, power normalized cepstral coefficient (PNCC) features were successfully adopted to other tasks, including speaker verification. However, as a feature extractor with long-term operations on the power spectrogram, its temporal processing and amplitude scaling steps dedicated on environmental compensation may be redundant. Further, they might suppress intrinsic speaker variations that are useful for speaker verification based on deep neural networks (DNN). Therefore, in this study, we revisit and optimize PNCCs by ablating its medium-time processor and by introducing channel energy normalization. Experimental results with a DNN-based speaker verification system indicate substantial improvement over baseline PNCCs on both in-domain and cross-domain scenarios, reflected by relatively 5.8% and 61.2% maximum lower equal error rate on VoxCeleb1 and VoxMovies, respectively.